De-identification Methods for Open Health Data: The Case of the Heritage Health Prize Claims Dataset
نویسندگان
چکیده
BACKGROUND There are many benefits to open datasets. However, privacy concerns have hampered the widespread creation of open health data. There is a dearth of documented methods and case studies for the creation of public-use health data. We describe a new methodology for creating a longitudinal public health dataset in the context of the Heritage Health Prize (HHP). The HHP is a global data mining competition to predict, by using claims data, the number of days patients will be hospitalized in a subsequent year. The winner will be the team or individual with the most accurate model past a threshold accuracy, and will receive a US $3 million cash prize. HHP began on April 4, 2011, and ends on April 3, 2013. OBJECTIVE To de-identify the claims data used in the HHP competition and ensure that it meets the requirements in the US Health Insurance Portability and Accountability Act (HIPAA) Privacy Rule. METHODS We defined a threshold risk consistent with the HIPAA Privacy Rule Safe Harbor standard for disclosing the competition dataset. Three plausible re-identification attacks that can be executed on these data were identified. For each attack the re-identification probability was evaluated. If it was deemed too high then a new de-identification algorithm was applied to reduce the risk to an acceptable level. We performed an actual evaluation of re-identification risk using simulated attacks and matching experiments to confirm the results of the de-identification and to test sensitivity to assumptions. The main metric used to evaluate re-identification risk was the probability that a record in the HHP data can be re-identified given an attempted attack. RESULTS An evaluation of the de-identified dataset estimated that the probability of re-identifying an individual was .0084, below the .05 probability threshold specified for the competition. The risk was robust to violations of our initial assumptions. CONCLUSIONS It was possible to ensure that the probability of re-identification for a large longitudinal dataset was acceptably low when it was released for a global user community in support of an analytics competition. This is an example of, and methodology for, achieving open data principles for longitudinal health data.
منابع مشابه
Improving Fraud and Abuse Detection in General Physician Claims: A Data Mining Study
Background We aimed to identify the indicators of healthcare fraud and abuse in general physicians’ drug prescription claims, and to identify a subset of general physicians that were more likely to have committed fraud and abuse. Methods We applied data mining approach to a major health insurance organization dataset of private sector general physicians’ prescription claims. It involved 5 ste...
متن کاملAn Adversarial Analysis of the Reidentifiability of the Heritage Health Prize Dataset
I analyze the reidentifiability of the Heritage Health Prize dataset taking into account the auxiliary information available online and offline to a present-day adversary. A key technique is identifying providers, which is useful both as an end in itself and as a stepping stone towards identifying members. My primary findings are: 1. Grouping providers based on shared members results in the for...
متن کاملIntegrating the Population Perspective into Health System Performance Assessment (IPHA): Study Protocol for a Cross-Sectional Study in Germany Linking Survey and Claims Data of Statutorily and Privately Insured
Background Health system performance assessment (HSPA) is a major tool for evidence-based governance in health systems and patient/population-orientation is increasingly considered as an important aspect. The IPHA study aims (1) to undertake a comprehensive performance assessment of the German health system from a population perspec...
متن کاملA Case for Open Network Health Systems: Systems as Networks in Public Mental Health
Increases in incidents involving so-called confused persons have brought attention to the potential costs of recent changes to public mental health (PMH) services in the Netherlands. Decentralized under the (Community) Participation Act (2014), local governments must find resources to compensate for reduced central funding to such services or “innovate.” But innovation, even when pressure for c...
متن کاملThe Economic Burden of Stroke Based on South Korea’s National Health Insurance Claims Database
Background This study was conducted to determine the scale and the nature of the economic burden caused by strokes and to use the results as an evidential source for determining the allocation of South Korea stroke cases in 2015. Methods For research subjects, the study analyzed demographic characteristics and economic burden based on data from national health insurance (NHI) claims for inpat...
متن کامل